Database Foundations for Scalable RDF Processing
نویسندگان
چکیده
As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and querying RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As centralized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query processing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chapter, we argue that extracting knowledge from the Web is an excellent showcase – and potentially one of the biggest challenges – for the scalable management of uncertain data we have seen so far. The third part of the lecture is thus intended to provide a close-up on current approaches and platforms to make reasoning (e.g., in the form of probabilistic inference) with uncertain RDF data scalable to billions of triples. 1 RDF in centralized relational databases The increasing availability and use of RDF-based information in the last decade has led to an increasing need for systems that can store RDF and, more importantly, efficiencly evaluate complex queries over large bodies of RDF data. The database community has developed a large number of systems to satisfy this need, partly reusing and adapting well-established techniques from relational databases [122]. The majority of these systems can be grouped into one of the following three classes: 1. Triple stores that store RDF triples in a single relational table, usually with additional indexes and statistics, 2. vertically partitioned tables that maintain one table for each property, and 3. Schema-specific solutions that store RDF in a number of property tables where several properties are jointly represented. In the following sections, we will describe each of these classes in detail, focusing on two important aspects of these systems: storage and indexing, i.e., how are RDF triples mapped to relational tables and which additional support structures are created; and query processing, i.e., how SPARQL queries are mapped to SQL, which additional operators are introduced, and how efficient execution plans for queries are determined. In addition to these purely relational solutions, a number of specialized RDF systems has been proposed that built on nonrelational technologies, we will briefly discuss some of these systems. Note that we will focus on SPARQL processing, which is not aware of underlying RDF/S or OWL schema and cannot exploit any information about subclasses; this is usually done in an additional layer on top. We will explain especially the different storage variants with the running example from Figure 1, some simple RDF facts from a university scenario. Here, each line corresponds to a fact (triple, statement), with a subject (usually a resource), a property (or predicate), and an object (which can be a resource or a constant). Even though resources are represented by URIs in RDF, we use string constants here for simplicity. A collection of RDF facts can also be represented as a graph. Here, resources (and constants) are nodes, and for each fact , an edge from s to o is added with label p. Figure 2 shows the graph representation for the RDF example from Figure 1. Fig. 1. Running example for RDF data
منابع مشابه
A Scalable RDF Data Processing Framework based on Pig and Hadoop
In order to effectively handle the growing amount of available RDF data, scalable and flexible RDF data processing frameworks are needed. While emerging technologies for Big Data, such as Hadoop-based systems that take advantages of scalable and fault-tolerant distributed processing, based on Google’s distributed file system and MapReduce parallel model, have become available, there are still m...
متن کاملSPARQL Query Processing with Conventional Relational Database Systems
This paper describes an evolution of the 3store RDF storage system, extended to provide a SPARQL query interface and informed by lessons learned in the area of scalable RDF storage.
متن کاملBitMat – Scalable Indexing and Querying of Large RDF Graphs
The growing size of Semantic Web data expressed in the form of Resource Description Framework (RDF) has made it necessary to develop effective ways of storing this data to save space and to query it in a scalable manner. SPARQL – the query language for RDF data – closely follows SQL syntax. As a natural consequence most of the RDF storage and querying engines are based on modern database storag...
متن کاملFrom relational databases to Linked Data: R for the semantic web
Most of the current structured data in the world lives in relational Data bases (RDB). But statisticians and practitioners sooner or later will find themselves dealing with a different kind of structured representations in the Resource Description Framework (RDF). The logical evolution of the current Web of documents into a Web of Data (and ultimately a Semantic Web) requires mapping of vast qu...
متن کاملTitle of dissertation : SCALABLE ONTOLOGY SYSTEMS
Title of dissertation: SCALABLE ONTOLOGY SYSTEMS Octavian Udrea, Doctor of Philosophy, 2008 Dissertation directed by: Professor V.S. Subrahmanian Department of Computer Science Since the adoption of the Resource Description Framework (RDF) by the World Wide Web Consortium (W3C), ontologies have become commonplace as a way to represent both knowledge and data. RDF databases have flexible schemas...
متن کامل